Skip to content

Recover squash-merged PR commits#30

Merged
cvetty merged 7 commits into
mainfrom
feature/squash-merge-recovery
Jun 11, 2026
Merged

Recover squash-merged PR commits#30
cvetty merged 7 commits into
mainfrom
feature/squash-merge-recovery

Conversation

@cvetty

@cvetty cvetty commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Squash-merging a PR collapses its commits into a single commit on the default branch, erasing the per-commit history WhyGraph relies on for evidence and rationale. This PR recovers that lost context — surfacing the original PR commit titles and review comments, and re-attributing squashed work back to its true PR origin on a per-line basis.

Changes

  • Stage 0 — evidence/rationale surfacing: PR commit_titles and review comments now flow into evidence and rationale cards (mcp/evidence.py, analyze/rationale.py, analyze/rationale_generator.py).
  • Stage 1 — squash recovery: new commit.on_default_branch discriminator (with Alembic migration) plus a scan-time pr_origin_enricher that recovers squash-merged PR commits (scan/pr_origin_enricher.py, db/models/commit.py, cli/commands/scan.py).
  • Stage 2 — per-line attribution: re-blame against PR origin to attribute individual lines of a squashed merge back to their originating PR (mcp/evidence.py, mcp/path_history.py).
  • Git plumbing: new repository/command helpers for fetching PR metadata (services/git/commands.py, services/git/repository.py).
  • Config: new toggles wired through core/config.py and whygraph.example.toml.
  • Test coverage across the enricher, evidence PR-origin path, config, git fetch metadata, and DB plumbing.

cvetty added 6 commits June 11, 2026 16:25
…/rationale

A squash merge collapses a feature branch's commits into one commit on the
default branch. WhyGraph already ingests the lost narrative (commit_titles,
comments on the PullRequest row) but dropped both at the final serialization
step. Stage 0 stops dropping them — no schema change.

- mcp/evidence.py: _pr_dict now emits commit_titles + comments (uncapped — the
  evidence tool's consumer is an agent that handles the full lists).
- analyze/rationale_generator.py: _format_pr renders a "Squashed commits" roster
  and a "Discussion" block, clipped by _PrRenderCaps. Caps are threaded purely
  from RationaleConfig via the generator (no get_config reach-in in the
  module-level formatters).
- core/config.py + whygraph.example.toml: three [rationale] rendering caps
  (pr_roster_max_commits=30, pr_discussion_max_comments=20,
  pr_comment_max_chars=500) with the same >=1 validation as max_diff_chars.
Step 2 of the squash-merge recovery plan. Adds an int 0/1 on_default_branch
column to commit (default 1, server_default text("1")) so PR-origin commits
recovered from squash-merged PRs can be flagged 0 and kept out of the
main-walk-only queries (area-history, refactor-walk).

Migration uses a plain op.add_column (native ALTER TABLE ADD COLUMN) rather
than a batch recreate: recreating commit would trip the commit_file_change
foreign key on a populated DB. Additive + server-default backfilled, so
existing rows become on_default_branch=1 and re-scans stay safe.

No new table or pr_id column — the PR<->commit link stays the existing
commit_titles/_linked_prs path (plan 4.3).
Step 3 of the squash-merge recovery plan. When a feature PR is
squash-merged, its original feature-branch commits are fetched once during
the remote scan and persisted as on_default_branch=0 commit rows, linked to
their PR through the existing commit_titles (no link table).

- scan/pr_origin_enricher.py: new PROriginEnricher crawler. Balanced gate
  (plan 3.3/3.5) — a merged PR is enriched when its commit_titles oids are
  absent from commit (squash detection) AND (the merge commit is file-bulk
  OR it collapsed >= pr_origin_min_commits commits). One targeted batched
  git fetch carries only the gated candidates' refs/pull/N/head refspecs
  (never the refs/pull/* wildcard), pinned under refs/whygraph/pull/*.
  Best-effort: a failed fetch or unreadable oid is logged and skipped, never
  failing the scan (plan 6.6). Idempotent across re-scans.
- services/git: GitFetchRefsCmd + GitLogCommitCmd, with thin
  Repository.fetch_refs / Repository.commit_metadata (reuses Commit.from_git_log).
- cli scan: --pr-origins/--no-pr-origins flag (default on), wired as a
  phase-2 sibling of analyze; gated on a resolved GitHub client so it is
  skipped under --no-remote. Added a panel row.
- core/config: AnalyzeConfig.pr_origin_min_commits (default 5) + >=1
  validation + example toml line.
- mcp 4.10 guards: _boring_shas_in and area_history_commits now filter
  on_default_branch == 1 so recovered origin commits never leak into the
  main-walk-only queries (defensive — they carry no commit_file_change rows).

Tests: balanced-gate unit tests, end-to-end work() with a stubbed repo
(asserts only candidate refspecs, on_default_branch=0 rows, dedup of an oid
shared across PRs, graceful fetch-failure degrade), real-git fetch_refs/
commit_metadata tests, config default/validation, and the 4.10 guard tests.
Full suite: 489 passed.
Step 5 (final) of the squash-merge recovery plan. When a queried line
blames to a squash-merged PR's commit that Stage 1 enriched,
collect_evidence now re-blames the same range at the PR's head_sha so each
line maps back to the original feature-branch commit that authored it —
surfaced as source="pr-origin".

- mcp/evidence.py: _attribute_squash_origins re-blames at head_sha, modelled
  on _predecessor_blame (same rev=-blame call, same best-effort per-PR
  GitError swallow). _enriched_squash_prs_for gates on the correct
  §4.8 predicate: blame SHA == a PR's merge_commit_sha AND that PR has >= 1
  on_default_branch=0 origin row — gate-agnostic, so it covers both
  file-bulk and commit-rich enriched squashes. Hunks feed the existing
  labeled_hunks list, so dedupe / priority / cap machinery is unchanged.
- _SOURCE_PRIORITY: pr-origin = 0.5 (just below blame=0); a real authoring
  commit reached through the squash beats every weaker label but loses to a
  direct HEAD blame hit.
- Labels (§4.9): _SOURCE_LABELS entry (rationale_generator); CommitEvidence.
  source docstring enum (analyze/rationale); evidence.py module +
  collect_evidence docstrings updated four→five signals.

Step 4 (lazy diffs for origin commits) needed no code: the enricher leaves
llm_description NULL and origin commits are normal-sized, so
backfill_evidence_descriptions already routes them through the normal /
backfill_all branch, and Repository.diff resolves now the object is local.

Tests: real squash repo (feature branch squashed into main, tip pinned
under refs/whygraph/pull/1, branch deleted) asserting pr-origin entries
match git blame head_sha; small/non-file-bulk squash variant (§4.8);
graceful degrade on a GC'd head_sha; unenriched squash → no attribution;
source-priority unit test. Full suite: 493 passed.
@cvetty cvetty added the enhancement New feature or request label Jun 11, 2026
@cvetty cvetty self-assigned this Jun 11, 2026
Collapse over-wrapped statements onto single lines so `ruff format --check`
(run repo-wide in CI) passes. No behavior change.
@cvetty cvetty merged commit 48c87f5 into main Jun 11, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant